Overview
This session doesn’t assume any prior knowledge of R, and introduces the basics. For some students this will include revision of material from stage 1. However we provide additional material for advanced students to test their knowledge and extend familiar skills.
To be clear, this repetition is intentional: we find most students will benefit from refreshing their knowledge at this stage in the course. Even if you are quite confident when using RStudio please read the worksheet carefully and complete all of the activities in the blue boxes.
Techniques covered
Using the RStudio interface
- Access RStudio at https://rstudio.plymouth.ac.uk
- Use the latest version of the Firefox web browser
- Tell R what to do in the
Consolepane - See the
Environmentpane for stored data - Use the
Filespane to open code and data from a folder on the server
No code in this video!
- Using a web browser to access the RStudio Server at Plymouth University.
If you’re using Windows or an older Mac we strongly recommend downloading Firefox and using that. If you have any issues with RStudio this is likely the first suggestion we will make.
When you login to RStudio, you’ll be greeted with a screen that looks something like the image below.
RStudio on first opening
You can see three parts:
The Console - This is the large rectangle on the left. It is where you tell R what to do, and where R prints the answers to your questions.
The Environment - This is the rectangle on the top right. It is where R keeps a list of the data it knows about. It’s empty at the moment, because we haven’t given R any data yet.
The Files - This is the rectangle on the bottom right. It’s a bit like the File Explorer in Windows, or the Finder on a Mac. It shows you what files and folders R can see.
You should also be able to see that the two rectangles on the right have a number of other “tabs”. These work like tabs on a web browser.
The top rectangle has the tabs Environment and History. The History tab keeps a record of what you’ve recently typed into the Console. This can sometimes be useful.
The bottom rectangle has the tabs Files, Plots, Packages, Help, and Viewer. We’ll cover what these other tabs do later on.
Before you start
- Before starting you must run some R code to get set up.
- See the code tab or the exercise below.
# run an R script over the internet which will get you
# set up, and copy files you need to your home folder
source("https://raw.githubusercontent.com/benwhalley/lifesavR/main/bootstrap.R")To get everyone off to the same start we have created a script that copies some files into your home folder on the RStudio server.
To run this script, we just copy and paste the following line into the Console:
source("https://raw.githubusercontent.com/benwhalley/lifesavR/main/bootstrap.R")- Click on the Console pane.
- Copy-paste the following into the console:
source("https://raw.githubusercontent.com/benwhalley/lifesavR/main/bootstrap.R")
Your console should now look like this:
Press ↩︎ to run the code. If your console looks like the image below, then you are ready to start the session.
Using the workbooks
- Each session has an associated “workbook” file
- They end with the file extension
".rmd" - These were copied to your home folder by the bootstrap script (above)
- Use them to complete the exercises in the worksheet
No code shown in this video
Each session has an associated “workbook” file which you will use to complete the exercises in the worksheet. The file you need for this session is called session-1.rmd.
If you click on the file it opens the workbook in a tab of a new pane, called the Source pane. It’s called the Source pane because statements writting in the R language are often referred to as ‘R code’, which is shorthand for ‘R source code’. The source pane allows you to write R code and explore your data.
Click on session-1.rmd in the Files pane.
If you’re able to open this file you are now ready to start the rest of the session.
What can R do?
- R is a multi-purpose tool
- It can do simple arithmetic, load data, make plots etc.
- It can also run any statistical analysis you like
- You need to tell R exactly what to do, by providing precise instructions
- These instructions (code) provide a reproducible record of your work
# multiply two numbers
2 * 221
# generate some random numbers with a normal distribution
rnorm(10, 0,1)
# histogram plot of random numbers
hist(rnorm(100, 0,1))R is a computer language for data analysis and visualisation.
RStudio is a user interface to R; it helps you organise your work.
R is a text-based language. You interact with it by typing commands and running (also called ‘executing’) them.
R can do everything from simple arithmetic and plotting to complex data analysis.
For example, you can do simple arithmetic like
[2 * 221]
We could generate some random numbers with a normal distribution
rnorm(10, 0,1)
[1] -0.30718194 0.05030063 -0.76868554 -1.27331824 2.14723562 1.68826929 -0.38580193 -0.46802252 0.49992883
[10] -0.51172558And we could plot random numbers using a histogram
hist(rnorm(100, 0,1))You should think of R as a robot.
The robot is extremely fast, powerful and tireless; but it’s also literal-minded, and won’t think for itself or take the initiative. You need to tell it exactly what to do, by providing very precise instructions.
The advantage of writing detailed instructions is that you have a detailed, reproducible version of all your analyses.
Reproducibility is a key topic in psychology and other natural sciences — learning R (or something like it) is an important skill for new psychologists.
Working interactively in R Markdown
- RMarkdown documents combine ‘chunks’ of R code with regular text
- RMarkdown files end with “
.rmd” or “.Rmd” - To make a chunk type:
Ctrl + Alt + I(Windows/Linux) or⌘ + Alt + I(Mac)
- To run a line type
Ctrl + ↵(Windows/Linux) or⌘ + ↩︎(Mac) - Run part of a line select it first, then use the same shortcut
- Anything outside a chunk is just narrative (ordinary) text and not treated as code.
Backticks:
On windows
On a Mac
No code shown in this video
RMarkdown documents are a good way to use R
RMarkdown is a file format which combines R code (chunks) with regular text.
RMarkdown can combine data analysis and graphs with explanatory text.
[SHOW EXAMPLE OF AN R MARKDOWN DOCUMENT.]
In the finished document, the Code is evaluated and the results are interspersed with text.
This allows us to make high quality reports, research papers, dissertations or books.
Because it’s such a powerful tool, this module provides an early introduction to RMarkdown, although we don’t introduce all it’s features just yet.
For the moment, we’ll only be using Rmd document as an interactive interface for running R code and looking at the results R produces.
R Markdown documents can be used interactively in RStudio
One neat feature of Rmd files is that, when you open them in RStudio, they make it easy to organise and run R code, and see the outputs.
If you click on the lifesavr folder in the Files pane of RStudio,
[CLICK ON FILES FOLDER IN VIDEO.]
you’ll notice that some files have the extension .rmd. These are R Markdown files.
[HIGHLIGHT FILE EXTENSION BY SELECTING OR POINTING WITH MOUSE.]
The file extension .rmd (or .Rmd) is important, because this is how R Studio knows that the files contain a mixture of R code and regular text.
Code chunks
RStudio needs to distinguish R code from regular narrative text.
This is done by putting the code inside some special characters, creating a chunk.
A chunk is opened using the symbols ```{r}, and closed using the symbols ```. This is what a chunk looks like in RStudio:
A code chunk in the RMarkdown editor
NOTE: The symbols which start and end a chunk are backticks, not single quotes. The difference is quite subtle.
Backticks are on your keyboard here if you’re on Windows:
On windows
Or here if you’re on a Mac:
On a Mac
Running R code inside chunks
There are three ways to run R code within a chunk.
The first is to run a complete line of code.
You can see here that our cursor is on line 12. The cursor can be anywhere on that line. To run the line, press Ctrl + ↵ on Windows or Linux, or ⌘ + ↩︎ on a Mac.
Pressing these keys has run or executed that line of code.
You’ll see some output beneath the chunk that you don’t need to worry about for now, but one of the effects of running this code is to load a dataset about diamonds (prices, sizes, quality, etc).
Now that line 12 has been run, the cursor has been pushed down to line 13.
Lines 13 to 15 are actually part of the same statement — that is, R knows they are related to one another.
We use the same keys, Ctrl + ↵, to run these lines. This generates a scatter plot using the diamonds dataset. Don’t worry how these statements work for now — the point here is to show you that we can run code interactively using Rmarkdown.
The second way to run code is to select only the parts you want to execute. If you select just the word diamonds on line 13 and run that, you will see that it does something different.
[SELECT DIAMONDS AND RUN IT.]
This prints the contents of the diamonds data. Because the dataset is large, it just prints the first few rows.
Finally, you often want to run all of the code in a chunk at once.
This can be done by pressing the green arrow on the right hand side of the chunk.
Another way to run all of the code is to position your cursor anywhere within the chunk and press Ctrl + ⇧ + ↵ (Windows, Linux) or ⌘ + ⇧ + ↩︎ (Mac).
Exercise 1
- Locate the first chunk in
session-1.rmd(you find this in the Files pane) - Place your cursor (anywhere) on the line that says
library(tidyverse)(this code is explained in the next section) - Run the code by pressing Ctrl + ↵ (Windows, Linux) or ⌘ + ↩︎ (Mac)
You will see some output appear beneath the chunk. Don’t worry about the details for now, we’ll explain those later.
Exercise 2
Position your cursor on the line that says diamonds and run the code.
You should see the following scatter plot of the diamonds data appear below the chunk:
Congratulations! You have just run your first lines of R. The code to produce the plot consisted of three lines. You can also run part of a line by highlighting just the code you want to run, as you’ll see in the next exercise.
Exercise 3
- Select (highlight) the word
diamonds. - Run the code.
This prints the first few lines of the diamonds data:
Example of running highlighted code
Exercise 4: Making new chunks
- Find the instructions for Exercise 4 in your workbook.
- Create a new chunk below the instructions.
- Inside the chunk, write a line of code which adds together the numbers 9, 4, 55 and 2.
- Run the the line of code you have written.
The output from the chunk should look like this:
Result from Exercise 4
Loading packages
- Loading a ‘package’ adds functionality to R
- Some packages (like
tidyverseandpysdata) also include example datasets - To load tidyverse write
library(tidyverse) - Load
tidyverseandpsydatabefore each session
The following R code is used in the video:
# load the tidyverse package
# (this also loads the diamonds example dataset, and some others)
library(tidyverse)By loading ‘packages’, you can add extra functions and datasets to R.
Packages are a powerful feature which allow R to be extended. This means you can run almost any analysis, or make any type of plot.
Packages are loaded using the library() function.
The first function you ran above was library(tidyverse). This loaded additional functions you need to make a scatter plot.
The tidyverse package is so fundamental to this course that library(tidyverse) is likely to be the first line of R code, in the first chunk, in all your RMarkdown files.
It’s a good idea to load packages at the top of your R code files. This makes it easy to see which have been loaded, and avoids loading them twice which is occasionally a problem.
You do need to remember to actually run the lines of code to load libraries though. Beginners often forget to do this — but it’s an easy error to fix.
If you’ve understood what packages are then it should be clear that you can’t use the functions provided by tidyverse until you’ve run the command: library(tidyverse).
For example, if you tried to produce the scatter plot before loading tidyverse you’d see an error like this in the console:
Error in diamonds %>% ggplot(aes(carat, price, colour = clarity)) : could not find function "%>%"
This is important because could not find function errors are one of the most common problems that beginners encounter. They normally mean that you have
- forgotten to include
library(tidyverse)as the first line in your code, or - forgotten to run that line.
Datasets
Datasets are like spreadsheets. They have have:
- multiple rows, with one row per observation
- multiple columns; each column has a name.
- columns also (sometimes) get called variables; this can be confusing
Where are datasets?
- R has some built-in datasets as learning examples
- The
psydatapackage includes datasets used in this course - Later on, we will import data from files (e.g. actual spreadsheets)
Exploring and checking data
- View a whole dataset by typing its name and running it in a code chunk
glimpse()shows a list of all the columns, plus a few of the datapoints- The Environment pane shows a spreadsheet-like view of the data
# always laod the tidyverse first
library(tidyverse)
# the psydata package contains datasets for this course
library(psydata)
# display the `fuel` dataset, by typing the name
# and running this in a code chunk
fuel
# show only the first 6 rows of the `fuel` data
head(fuel)
# shows a list of columns in the `development` dataset
# plus the first few datapoints (as many as will fit)
glimpse(development)Datasets contain are like spreadsheets: they are organised into columns and rows.
In R, datasets are normally stored in a container called a data.frame. They can also be stored in a tibble (these are basically the same thing).
Columns
Each column in a dataset has a name.
We sometimes call the columns variables, because each column will often relate to a variable in our study.
However, this can be a bit confusing because — in R — variables can actually contain whole datasets. fuel, for example, is the name of a variable which contains an example dataset, provided by the psydata package.
[SHOW LIBRARY(PSYDATA) AND THEN THE FUEL DATASET]
But these words are used flexibly and interchangeably, so we’ll just have to get used to it. It’s normally clear which type of variable we mean from the context.
Rows
Each row in a dataset represents an observation.
In different datasets an observation might correspond to an individual participant, a whole country, or even just a single button press in an experiment.
Packaged datasets
Some datasets are built into R packages as examples for beginners.
For this course, we created a package called psydata which includes the data we need for teaching.
This is installed on the RStudio server. To load it we run:
library(psydata)We can see from the loading message that one of the datasets is called fuel. This contains data about cars — things like weight, fuel economy, engine size.
Let’s display this data in using a new chunk. If we type the word fuel, select this variable name with our cursor, and ‘execute’ it, we can see the data it contains:
fuel
mpg cyl engine_size power weight gear automatic
1 21.0 6 2620 110 1188 4 TRUE
2 21.0 6 2620 110 1304 4 TRUE
3 22.8 4 1770 93 1052 4 TRUE
4 21.4 6 4230 110 1458 3 FALSE
5 18.7 8 5900 175 1560 3 FALSE
6 18.1 6 3690 105 1569 3 FALSE
7 14.3 8 5900 245 1619 3 FALSE
8 24.4 4 2400 62 1447 4 FALSE
9 22.8 4 2310 95 1429 4 FALSE
10 19.2 6 2750 123 1560 4 FALSE
11 17.8 6 2750 123 1560 4 FALSE
12 16.4 8 4520 180 1846 3 FALSE
13 17.3 8 4520 180 1692 3 FALSE
14 15.2 8 4520 180 1715 3 FALSE
15 10.4 8 7730 205 2381 3 FALSE
16 10.4 8 7540 215 2460 3 FALSE
17 14.7 8 7210 230 2424 3 FALSE
18 32.4 4 1290 66 998 4 TRUE
19 30.4 4 1240 52 733 4 TRUE
20 33.9 4 1170 65 832 4 TRUE
21 21.5 4 1970 97 1118 3 FALSE
22 15.5 8 5210 150 1597 3 FALSE
23 15.2 8 4980 150 1558 3 FALSE
24 13.3 8 5740 245 1742 3 FALSE
25 19.2 8 6550 175 1744 3 FALSE
26 27.3 4 1290 66 878 4 TRUE
27 26.0 4 1970 91 971 5 TRUE
28 30.4 4 1560 113 686 5 TRUE
29 15.8 8 5750 264 1438 5 TRUE
30 19.7 6 2380 175 1256 5 TRUE
31 15.0 8 4930 335 1619 5 TRUE
32 21.4 4 1980 109 1261 4 TRUEBy default this shows the first ten rows and columns of the data. You can see other rows using the Next, Previous and number buttons below the data.
If your browser window is very narrow you may need to view some of the columns by using the arrow next to the final, right-hand column.
You can get information about the columns in all these example datasets by typing: help(name_of_the_dataset_you_want_to_know_about). For example:
help(fuel)
No documentation for 'fuel' in specified packages and libraries:
you could try '??fuel'Exploring and checking data
There are two ways we recommend to inspect and check data you are using.
- Typing the name of the dataset, and running that as code
- The
glimpse()function, which shows a list of all the columns and some of the data - The
head()function, which shows the first 10 rows
To use glimpse:
glimpse(fuel)
Rows: 32
Columns: 7
$ mpg [3m[38;5;246m<dbl>[39m[23m 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14…
$ cyl [3m[38;5;246m<dbl>[39m[23m 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4
$ engine_size [3m[38;5;246m<dbl>[39m[23m 2620, 2620, 1770, 4230, 5900, 3690, 5900, 2400, 2310, 2750, 2750, 4520, 4520, 4520, 7730, 7540, 72…
$ power [3m[38;5;246m<dbl>[39m[23m 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65, 97, …
$ weight [3m[38;5;246m<dbl>[39m[23m 1188, 1304, 1052, 1458, 1560, 1569, 1619, 1447, 1429, 1560, 1560, 1846, 1692, 1715, 2381, 2460, 24…
$ gear [3m[38;5;246m<dbl>[39m[23m 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 4
$ automatic [3m[38;5;246m<lgl>[39m[23m TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…This shows a list of all the columns in the dataset, the type of data stored in each column, and as many observations (datapoonts) as will fit on a single line.
glimpse is a really useful view to check which columns are available in a dataset before using them.
Using head:
head(fuel)
mpg cyl engine_size power weight gear automatic
1 21.0 6 2620 110 1188 4 TRUE
2 21.0 6 2620 110 1304 4 TRUE
3 22.8 4 1770 93 1052 4 TRUE
4 21.4 6 4230 110 1458 3 FALSE
5 18.7 8 5900 175 1560 3 FALSE
6 18.1 6 3690 105 1569 3 FALSEThis prints the first 10 rows of all the columns. head() is really useful for checking the actual datapoints are as you expect before using them.
Why are we talking about cars not psychology?
In this course we mostly use very simple datasets, and some of them aren’t even about psychology.
Some students ask why we don’t always use psychological examples. If this hasn’t troubled you then you could skip to the next section, but we thought we should explain:
We think the fuel dataset (and others, like iris, and development) have a number of benefits.
First, they are either built into R, loaded in common packages, or available in the psydata package. This makes them easily available for everyone.
Second, these data relate to concrete, easy to understand phenemena (e.g. weight, length, number of gears). This means you don’t have to hold in mind any complex psychological/theoretical ideas for the examples to make sense.
Third, the relationships in these datasets are clear, and there aren’t too many data points. Real data are often more messy because many psychological constructs are hard to measure.
Our experience is that, when learning R, it pays to keep everything as simple as it possibly can be. The skills and concepts involved in analysing these data are the same though.
R — and the techniques and statistics we teach — are used right across the natural sciences
[ TODO show examples of plots and analyses here]
If you’re still not convinced — don’t worry … we do include some clinical examples, and we will be collecting our own psychological data soon enough and analysing that.
Exercise 5
- Create a new chunk below the
Exercise 5heading in your workbook (session-1.rmd). - Load the
psydatapackage - Display the
fueldataset and try out the navigation buttons. - Make a list of columns in the
developmentdatset
The output should look like this:
The fuel dataset
Columns in the development dataset
Exercise 6
- Create a new chunk below the
Exercise 6heading in your workbook (session-1.rmd). - Load the
psydatapackage if you haven’t already done that in this work session - Show the first 10 rows of the
developmentdata
Use the output to answer the following question. After entering your answer, click outside the box. The border will turn turn blue when the answer is correct.
The population of Afghanistan in 1967 was: .
Making a scatterplot with ggplot()
- A scatterplot shows the relationship between two continuous variables (columns)
- Each observation (row) must have at least two values (columns)
- These define the position of a point on the x and y axes of the plot
- Use
ggplot() aes(x = ..., y = ...)chooses the x and y data columns and creates the axesgeom_point()adds the points
# if you have not already, load these packages
library(tidyverse)
library(psydata)
# make a scatterplot from the fuel dataset
fuel %>%
ggplot(aes(x=weight, y=mpg)) + # selects the columns to use
geom_point() # adds the points to the plot
# the same plot
# this time we left out x= and y= in the aes code
# these are implicit in the order of weight and mpg
# the x axis comes first
fuel %>%
ggplot(aes(weight, mpg)) +
geom_point() A scatterplot shows the relationship between two variables by plotting their values as points on an x axis (the left-right position) and y-axis (up down).
This code chunk creates a scatterplot. We start with the fuel data.
The %>% symbol is special, it’s called a ‘pipe’. We’ll explain more later, but for now just know that it sends the fuel data on to the next line of code — like it’s passing it down a pipe.
fuel %>%
ggplot(aes(weight, mpg)) +
geom_point()[SHOW THE CODE, SELECT THE PIPE WITH CURSOR]
The second line recieves the data. The ggplot() function tells us we are going to be making a plot.
The plot itself is built in two steps. The first step, ggplot(aes(weight, mpg)) selects columns in our dataset to use for the x and y axes. In this case, the x axis is weight, which is the weight of the cars in kg.
[SELECT WEIGHT IN CODE]
And mpg is miles per gallon, or fuel efficiency. This will be the y-axis.
We can see the plot if we put our cursor on the first line and press the shortcut — Ctrl or Cmd + Enter
[RUN THE PLOT AND SHOW]
As you’ve seen before, if we run code in an Rmarkdown document then the result is shown underneath the chunk.
Building plots in layers
A useful thing to know is that ggplot works by building up plots in multiple layers.
If we run just this part of the code, we can see the plot with just the axes, and no data shown.
[RUN JUST THE FIRST TWO LINES OF CODE BY SELECTING AND PRESSING CTRL+ENTER]
So, we make plots by:
- selecting data
- defining the axes, and then
- adding the data points
Each part of the plot is separated by a + symbol and goes on a new line.
RStudio is smart and knows all this is part of the sample plot, so automatically indents the code.
Cutting corners
There’s just one final thing to explain: In the previous code we wrote x = weight and y = mpg.
This makes things explicit, which is nice, but takes longer to type. You can also write the plot this way:
fuel %>%
ggplot(aes(weight, mpg)) +
geom_point() R assumes that the first variable is the x axis and the second is the y axis.
[SELECT X AXIS AND Y AXIS IN TURN WHEN DESCRIBING]
We use this style in these guides, and you should too.
Exercise 7
- Create a new chunk below the
Exercise 7heading in your workbook. - Using the
fueldataset, create a scatterplot withengine_sizeon the x-axis andmpg(miles per gallon, or fuel economy) on the y-axis. - Run the chunk.
The scatterplot should look like this:
Check your knowledge
Write an answer to each of these questions in the Check your knowledge section of your workbook. The answers will be revealed in Session 2.
- How do you run part of a line of R code using the keyboard short cut?
- Which library will you always need to load in your first R Markdown chunk?
- What is
psydata? - How would you look at/inspect a whole dataset?
- What does
glimpse()do and when is it useful? - What is the 5th column in the
developmentdataset? - Which function makes a plot? (there are many, but we mean the one shown above)
- Which function chooses the columns of data used in the plot?
Extension exercises
Please remember that these extension exercises are not required to pass the course. We include them because find that some students work through these materials much more quickly than others — perhaps because they have more previous experience with programming — and we aim to give all students the opportunity to stretch their skills.
If you do find you have extra time, however, these exercises are intended to provide additional practice in the technqiues taught here, and to be useful preparation for using R independently in a stage 4 or MSc research project.
Extension exercise 1
This scatterplot uses the fuel dataset to show a vehicle’s power on the x-axis against mpg on the y-axis.
In a new chunk, write the R code to produce this plot.
Extension exercise 2
There is another built-in dataset called iris which includes data about different flower species.
Use glimpse() to get a list of the column names.
Make a scatterplot which shows the relationships between petal widths and lengths.
Further reading
Scatterplots and visualisation: Fundamentals of Data Visualization is an excellent resource for data visualisation in R. This chapter: https://clauswilke.com/dataviz/visualizing-associations.html shows many examples of plots which display relationships between variables (including scatter plots) which would extend the material here.